dislcaimer for the general audience: Immunology is deeply complex. Additionally, immunologists have made very little effort to make the language or concepts they use understandable to a non-expert. So if this sails past you, it's not your fault. It's likely a combination of the true complexity, flaws in my communication strategy, and rotting jargon in the field.
Protein language models are inspired by natural language processing (NLP) models like GPT-4. They are designed to analyze and predict protein sequences, which are chains of amino acids with distinct functions in living organisms. These models are trained on large datasets of known protein sequences and appear to learn general truths and patterns about protein space through relatively simple and unsupervised training procedures. By understanding these patterns, the models can predict protein structure and function. Similar to NLP models, protein language models use a technique called "attention" to weigh the importance of different amino acids in the sequence, allowing them to capture long-range interactions and dependencies between amino acids. I wanted to understand how these models can be applied to studying antibody evolution in particular.
Ablang is a protein language model trained specifically on antibodies which was readily available to me and easy to install at the time. There's a strong argument to made for using a more general protein language model. At the least, one should compare the results here to something like ESM from Meta.
An antibody is a protective protein produced by the B cells in the immune system to neutralize and defend against harmful pathogens, such as viruses. Structurally, antibodies are composed of four polypeptide chains—two identical heavy chains and two identical light chains—linked by disulfide bonds, forming a Y-shaped molecule. The unique feature of antibodies lies in their variable region, located at the tips of the Y-shaped structure, which is responsible for binding specifically to the foreign substance, or antigen.
The variable region is generated stochastically through a process called V(D)J recombination. This process involves the pseudo-random selection, recombination, and joining of gene segments—Variable (V), Diversity (D), and Joining (J) segments—along with the introduction of pseudo-random nucleotide insertions and deletions between the recombined genes. This generates astromomical standing antibody sequence diversity which could never be encoded in the genome. Selection of particular antibodies from this diversity allows indivuduals to adapt their immune system to never-before-seen pathogens.
The antibody generation process creates a large pool standing diversity in protein space. Next, the immune system uses an evolutionary process to select and amplify the antibodies which have the most beneficial properties, such as strongly binding and neutralizing a pathogen which is wreaking havoc in your body. During the selection process, additional diversity is generated by mutating the selected antibodies, allowing the immune system iterate on creating an even better protein: it selects the best of the selected. This process is called somatic hypermuation, and it's pretty much fascinating.
All that you'd need to know to understand what I'm going to show in this notebook is that generally speaking more somatic hypermutation means the antibody in question has been through more of this evolutionary process than an antibody that has no somatic hypermutation
Text(0.5, 1.0, 'Antibody Constant Region')
Text(71.08333333333333, 0.5, 'PC3')
As an aside, we can see a fair amount of "no mutation" antibody sequences (highighted in the red box or seen in the tail of the orange distribtion) that actually appear more like mutated antibodies in the PC space. I asked if the gene expression of cells with that profile looked like Memory B cells, which would corroborate that the antibody actually is hypermutated. Indeed these cells with "no mutations" appear to have a memory-like phenotype, suggesting the language model identifies mutations in antibodies more sensitively than how I orginally labeled the mutation status, via comparison to a genomic database.
Text(0.5, 1.0, 'Memory B cells')
WARNING: Default of the method has been changed to 't-test' from 't-test_overestim_var' WARNING: Dendrogram not added. Dendrogram is added only when the number of categories to plot > 2
- the language model encodes antibody hypermutation on 2 relatively orthongonal axes
- one is just whether or not there are any mutations
- the other is how many mutations there are
- ad-hoc classification of antibody encoding types generally coheres with my orthogonal measurement of gene expression